library(qtlcharts)
library(CountClust)
library(parallel)
library(cellcycleR)
library(data.table)
library(binhf)
library(vioplot)
library(limma)
library(readxl)
In this script, we perform the cellcycleR method on the iPSC cells data collected by Po Yuan in Yoav’s lab. The data is still not publicly available, but you can check out John Blischak’s website for the information on the data collection, the preliminary analysis Yoav’a lab have been focussing on so far. For this data, unlike for Oscope and the Marioni data, the other two datasets we are looking at, we do not have any information on the cell phases. The cell phases were estimated using a very ad-hoc approach by Macosko et al, see their paper here.
We have batch corected and individual corrected the expression levels for the iPSC samples. Also, we have fitted admixture models to learn about the distinctive behavior across the different phases. Check our webpage for further details.
We now load the batch corrected gene expression data for all genes (with single cell samples along the rows and the genes on the columns).
setwd('/Users/kushal/Documents/singleCell-method/project/analysis')
molecules_single <- data.frame(fread("../data/batch_removed_counts_all_genes.txt"), row.names = 1);
cycle_counts_data <- molecules_single;
dim(cycle_counts_data)
## [1] 578 17084
Next, we perform voom to log normalize the gene expression across the cells and then for each gene, we perform mean correction and standardization. then we filter out the genes with 0 normalized expression across all the single cells.
cycle_voom_data <- voom(cycle_counts_data)$E;
cycle_data_norm <- apply(cycle_voom_data,2,function(x) return (x-mean(x))/sd(x))
celltime_levels <- 100;
cycle_data_norm <- cycle_data_norm[, -which(colSums(cycle_data_norm)==0)]
dim(cycle_data_norm)
## [1] 578 16860
The cell phases as assigned using Macosko method
cell_phases <- as.vector(as.matrix(read.table("../data/cell_phase_vector_yoav.txt")));
We next apply the cellcycleR method on the entire Yoav data.
out <- cell_ordering_class(cycle_data_norm, celltime_levels = 100, num_iter=50, save_path="../rdas/cell_order_ipsc_full.rda")
We ran the method above once already (took around 30 minutes) and now we just load the output.
out <- get(load(file="../rdas/cell_order_ipsc_full.rda"));
cell_order_full <- cell_ordering_full(out$signal_intensity, dim(cycle_data_norm)[2])
We look at the plots of the amplitudes, phases and the non signal variation of the genes.
amp_genes <- out$amp;
sd_genes <- out$sigma;
phi_genes <- out$phi;
plot(density(phi_genes), col="red", main="Density plot of the phases")
plot(density(amp_genes[-which.max(amp_genes)]), col="red", main="Density plot of the amplitudes")
plot(density(sd_genes[-which.max(sd_genes)]), col="red", main="Density plot of the non-signal sd")
vioplot(cell_order_full[which(cell_phases=="G1.S")],
cell_order_full[which(cell_phases=="S")],
cell_order_full[which(cell_phases=="G2.M")],
cell_order_full[which(cell_phases=="M")],
cell_order_full[which(cell_phases=="M.G1")],
names=c("G1","S","G2M","M","M.G1"),
col="red")
We extract the genes with high SNR - these are more likely to be sinusoidal.
ESS <- amp_genes^2; RSS <- sd_genes^2
SNR <- ESS/RSS;
plot(SNR, col="red", pch=20, lwd=1)
top_genes <- which(SNR > 10);
Next we plot the qtlcharts for these top sinusoidal genes and see if their patterns are indeed sinusoidal or not.
iplotCurves(t(cycle_data_norm[order(cell_order_full),top_genes]))
## Set screen size to height=700 x width=1000
We filter out the highly sinusoidal genes and repeat the procedure again.
snr_high_indices <- which(SNR > 1);
cycle_data_norm_sinusoidal <- cycle_data_norm[,snr_high_indices];
dim(cycle_data_norm_sinusoidal)
## [1] 578 2437
We apply the cell ordering mechanism (it takes around 5 minutes to run)
out2 <- cell_ordering_class(cycle_data_norm_sinusoidal, celltime_levels = 100, num_iter=100,
save_path="../rdas/cell_order_ipsc_sinusoidal_full.rda")
We reload the output
out2 <- get(load(file="../rdas/cell_order_ipsc_sinusoidal_full.rda"));
cell_order_full <- cell_ordering_full(out2$signal_intensity, dim(cycle_data_norm_sinusoidal)[2])
We plot the same features as above and check for the robustness. We needed to shift the cell order so as to compare with previous plot on all genes as the method is non-identifiable upto a rotation.
amp_genes <- out2$amp;
sd_genes <- out2$sigma;
phi_genes <- out2$phi;
plot(density(phi_genes), col="red", main="Density plot of the phases")
plot(density(amp_genes[-which.max(amp_genes)]), col="red", main="Density plot of the amplitudes")
plot(density(sd_genes[-which.max(sd_genes)]), col="red", main="Density plot of the non-signal sd")
ESS <- amp_genes^2; RSS <- sd_genes^2
SNR <- ESS/RSS;
plot(SNR, col="red", pch=20, lwd=1)
top_genes <- which(SNR > 3);
new_cell_order <- shift(order(cell_order_full),100,dir="right")
iplotCurves(t(cycle_data_norm_sinusoidal[new_cell_order,top_genes]))
This looks very sinusoidal, there also do not seem to be outliers screwing up the analysis.
vioplot(cell_order_full[which(cell_phases=="G1.S")],
cell_order_full[which(cell_phases=="S")],
cell_order_full[which(cell_phases=="G2.M")],
cell_order_full[which(cell_phases=="M")],
cell_order_full[which(cell_phases=="M.G1")],
names=c("G1","S","G2M","M","M.G1"),
col="red")
It seems that our method seems to mainly distinguish the G2.M and M phases from the other phases, which is again plausible because in stem cells, we mainly have the S (synthesis) and the mitosis (M) phase.